The first step we need to take in order to apply distributional semantics to our texts is to design a script that counts the number of co-occurrences for each co-occurrent $c$ within a certain range of each target word $t$. This notebook will lead you through this process step by step.
First, we need to write a function that takes as input the complete file path of a text file, breaks our texts down into an ordered list of words, and saves it as, well, a list. You did this is exercise 1b, so you should re-use your code as much as possible here.
In [1]:
from string import punctuation
import re
def txt_to_list(filename):
# insert your code here
words = []
[words.append(word.lower()) for word in (open(filename)).read().split()]
l = []
for word in words:
[l.append(w) for w in re.split('[%s]+' % punctuation, word)
if w != '']
return l
# Test your code on this short text. Make sure to look at the results!
tokens = txt_to_list('austen-emma-excerpt.txt')
print(tokens)
OK, that should have been easy. The next easy step is to write another short function that takes the list returned by the txt_to_list
function and produces a dictionary where the keys are the individual types in the text and the values are the total counts of that type in the text. Such a count dictionary will be necessary for our later calculations.
Hint: Using Counter
here will simplify the task significantly.
In [2]:
from collections import Counter
def make_count_dict(l):
# insert your code here
return Counter(l)
# Below tests your code. Again, make sure to check your results.
count_dict = make_count_dict(tokens)
print(count_dict)
Now, the next step will be a bit more complex. We now want to write a function that takes as input a token list and window size and returns a dictionary where the keys are the target words $t$, which are the members of your token list, and the values are also dictionaries where the keys are the co-occurrents $c$, which are, again, the members of your type list, and the values for this dictionary are the number of times that $c$ co-occurs within the window around $t$. Mathematically, it is $n(c,t)$.
In the end, your dictionary should look something like this: {'the': {'the': 1000, 'aardvark': 8, 'be': 100...}}.
Hint: Consider using a defaultdict
and a Counter
for this.
In [3]:
from collections import defaultdict
def make_cooc_dict(l, window_size = 4):
'''Takes as input a token list and a window size (default == 4).
The window size is the distance in words both left and right from the target word.
For instance, if you want 4 words left and 4 words right of your target word, window_size = 4
'''
d = defaultdict(Counter)
for i, word in enumerate(l):
w_l = []
[w_l.append(w) for w in l[max(i-window_size, 0):min(i+window_size+1, len(l))]]
w_l.remove(word)
d[word] += Counter(w_l)
# insert your code here
return d
# Below tests your code. Check your results.
cooc_dict = make_cooc_dict(tokens, window_size = 4)
#the following lines check to make sure that your cooc_dict is symmetrical
problems = []
[[problems.append((x,y)) for x in cooc_dict[y]
if cooc_dict[x][y] != cooc_dict[y][x]] for y in cooc_dict.keys()]
print(problems)
#the following line checks one tough case
cooc_dict['i'] == Counter({'chapter': 2, 'volume': 2, '1816': 2, 'woodhouse': 2, 'emma': 2, 'i': 2,
'handsome': 1, 'austen': 1, 'jane': 1, 'clever': 1})
Out[3]:
We have been using dictionaries up to this point instead of Pandas Series and DataFrame objects because the former are much more memory efficient than the latter. We should only switch over to Pandas objects when we want to start vectorizing our calculations. This is the point at which the increased memory drain of the Pandas objects pays for itself in speed!
Quiz:
Now it is time to put your functions to the test with your own texts. If you did not bring your own texts to the summer school with you, use the texts in the Data
folder for lesson 1b. Below, write a script that will go to the folder on your computer where your text files are, return a list of the names of all the .txt files in that folder (Hint: Check out os.listdir() function to help with this), and then runs each text through each of the functions you wrote above. Finally, convert both your dictionaries into Pandas objects (you decide which type
is the best for each dictionary) and save them as .pickle
files using the df.to_pickle()
method.
As background, pickle
serializes your Python objects, which basically means that it saves your Python objects as Python objects, e.g., it will save your dictionaries as Python dictionary objects. This is typically both more efficient in terms of disk storage space and in processing time when saving and reloading the objects from and back into Python.
Hint: You might also want to check out the tkinter.filedialog functions. They open an open or save file dialog interface so that you can choose the files that you want to work with on-the-fly. They are great tools to ease file interaction and to generalize the code you write for different purposes and different operating systems.
Hint #2: If you are running out of memory when producing your Pandas objects or pickling them, try del
objects that you don't need any more. For instance, once you have run the make_count_dict
and make_cooc_dict
functions, you don't need tokens
anymore. So type:
del tokens
Do the same with your dictionaries once you have converted them to Pandas objects and your Pandas objects once you have pickled them.
In [4]:
from glob import glob
from os.path import basename
import pandas
def process():
for filename in glob('./Data/*.txt'):
print(filename)
tokens = txt_to_list(filename)
#pandas.Series(Counter(tokens)).to_pickle('./Data/%s.count.pickle' % (basename(filename)[:-4]))
pandas.DataFrame(make_cooc_dict(tokens)).fillna(0).to_pickle('./Data/%s.cooc.pickle' % (basename(filename)))
process()
# insert your code here